{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# 9. Regular Expressions\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular expressions can be used to search for specific patterns within texts. When you search using patterns, rather than search terms, you can generally search in much more advanced ways. You can search for words with specific types of characters or for words containing a specific number fo characters, for instance. \n", "\n", "Such regular expressions typically consist of a sequence of symbols which specify a search action. Once defined, such regular expressions can be matched against actual strings. \n", "\n", "Regular expressions can be constructed using literal characters and so-called metacharacters. The simple regular expression ‘flower’, for instance, only contains literal characters. It can be used to search for the six characters that are mentioned. Metacharacters, by contrast, are characters with a special meaning. They represent specific types of characters, such as characters in lower case, digits, spaces or tabs. When you combine literal characters and metacharacters, you can search for patterns rather than for literal strings or keywords. \n", "\n", "## re.search()\n", "\n", "The standard installation of Python includes a useful module called `re`, which can be used to search for text fragments on the basis of regular expressions. To work with the module, you firstly need to import it. The module `re` contains a method named `search()`, which minimally requires two parameters. The first parameter is the pattern to search for, and the second parameter is the string in which you want to search. The method returns the value `True` if the pattern which is mentioned occurs in the string which is provided as the second parameter. \n", "\n", "The listing below offers an example. The regular expression, in this case, is simply a string consisting of literal characters. The code below tries to establish whether the string that is mentioned as the first parameter of `re.search()'`occurs in the sentence which is mentioned as the second parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "sentence = 'Mrs. Dalloway said she would buy the flowers herself.'\n", "\n", "if re.search( 'flower' , sentence ):\n", " print('The pattern was found in the sentence!')\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Meta-characters\n", "\n", "Next to literal characters, the following metacharacters may be used: \n", "\n", "\n", "\n", "
\n", "
Metacharacter | \n", "Description | \n", "
\\w | \n", "Any alphanumeric character: all 26 alphabetical characters or the Latin alphabet, both in upper case and in lower case, all numbers and the underscore. | \n", "
\\d | \n", "Digits. | \n", "
. | \n", "Any character, except the newline. | \n", "
\\s | \n", "White space: the space, a tab or a newline character. | \n", "
[A-Z] | \n", "Any upper case character. | \n", "
[A-Za-z] | \n", "Any upper case or lower case character. | \n", "
[...] | \n", "If only a limited number of characters are allowed on a specific position in a string, the characters that are allowed can be supplied in square brackets (i.e. on the place of the dots). | \n", "
\n", "
Quantifier | \n", "Description | \n", "
{n,m} | \n", "Pattern must occur a least n times, at most m times | \n", "
{n,} | \n", "At least n times. | \n", "
{n} | \n", "Exactly n times. | \n", "
? | \n", "Is the same as {0,1} | \n", "
+ | \n", "Is the same as {1,} | \n", "
* | \n", "Is the same as {0,} | \n", "
\n", "
Symbol | \n", "Description | \n", "
\\b | \n", "A word boundary. |
^ | \n", "The beginning of a string. |
$ | \n", "The end of a string. |